Initialization

Lets load the dataset and look at its structure and variables

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

It seems all of the variables are numerical. Even quality is numerical, but it could just as easily be categorical.

Exploratory analysis of the data

Analysis of variable distributions

Lets first explore the distribution of quality, as it is our dependant variable in this research.

Distribution of wine quality

The qualities of wine seem to be somewhat normally distributed around the median of 6. The tail tail is slightly higher on the lower-quality side, with 5-quality wines being by far the 2nd most numerous quality after 6. It also seems that no wines were given either a 10, or 0-2. Additionally, only 5 wines were of quality 9. As vast majority of wines seem to have a quality of either 5 or 6.

Therefore since there are so few wines qualities under 4 or over 8, we will subset the dataset to exclude them.

In order to determine independent variable distributions and discover possible outliers, lets also plot histograms of the independent variables.

Distribution of Fixed acidity

Fixed acidity seems very normally distributed with values falling between around 4.4 and 9.6. Lets look at the quantiles in the feature.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.851   7.300  14.200

Despite the data being quite normally distributed, there seems to be at least a few high outliers.

Looking at the distribution, lets cut the outliers by using only values less than 10. Values higher than that seem to be outliers.

Distribution of Volatile acidity

Volatile acidity seems to have a bit of a long tail on the incresing side of the values. Lets look at the quantiles of volatile acidity.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2779  0.3200  1.1000

The final 4th quantile of volatile acidity seems to be inside the long tail between 0.32 and 1.1. The values seem to have some very high outliers at the end of the long tail.

Lets cut out values over 0.70.

Distribution of Citric acid

Citric acid seems to also be quite normally distributed (albeit with a long tail) with a few peculiar exceptions that can distort the interpretation of the data. Citric acid has huge spikes in frequencies at 0.5 and 0.75 it seems. Especially the one at 0.5 is curious as it almost rivals the most frequent citric.acid levels at around 0.3. This may be due to the fact that the distillers of more acidic wine may opt for this exact amount of citric acid.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

The quantiles seem quite normally distributed with the exception of a few outliers in the final quantile. These spikes are only really noticeable in the visualization.

To exclude outliers, lets include only values less than or equal to 0.75 (to make sure the high citric.acid spike gets included).

Distribution of Residual sugar

Residual sugar seems very long tailed, with most wine not having much sugar.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.379   9.900  31.600

As seen in the visualization, mean is significantly higher and final quantiles are quite long after the first quantile, indicating the long tail also seen in the visualization. There are also a few very large outliers.

To cut them out, lets include only values less than 20.

Distribution of Chlorides

Chlorides seem very normally distributed with a few outliers at the end of a long tail. Overall the variance of the values seems quite low.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04557 0.05000 0.34600

The quantiles seem very even, although there are a few large values.

Lets cut the large values by only including values under 0.10.

Distribution of Free sulfur dioxide

Free sulfur dioxide is quite normally distributed except the right side of the curve seems slightly less steep. There are a few outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.12   46.00  138.50

The quantiles explain the same story as the visualization. The third quantile is 46 which is slightly longer than the 2nd quantile and the maximum value is 138.50 indicating some outliers.

Lets cut those out by only including values under 80.

Distribution of Total sulfur dioxide

Total sulfur dioxide seems also quite normally distributed with a slightly less steep curve on the right side. The variance in values seems quite high since the tails are not that steep.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   133.0   137.3   166.0   344.0

There also seems to be a few high values indicated by the maximum value.

Lets cut them out by only including values under 270.

Distribution of Density

Density values seems to have very little variance, almost all the values fall between 0.990 and 1.000.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9960  1.0020

The 4th quantile seems to have a small tail but overall there seems to be no real outliers. Therefore there is no real need to subset density.

Distribution of Ph

Similar to density, the pH-values have little variation with most of the values falling between 3 and 3.5. The distribution looks to be very normal.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.090   3.180   3.191   3.280   3.820

The quantiles seem to support what the visualization shows, there seems to not be any real outliers in the values. Therefore the subsetting of pH is unnecessary.

Distribution of Sulphates

Sulphates seem to be a bit long-tailed on the right side and the values have several peaks between 0.4 and 0.6.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.22    0.41    0.48    0.49    0.55    1.08

The quantiles also indicate the existance of a less steep curve on the 3rd and 4th quantiles. And there are a few outliers on the end of the 4th quantile. The quantiles do not explain the peaks in the values which can only really be seen by visualizing the distribution.

Lets cut the outliers by only accepting values under 0.9.

Distribution of Alcohol

Alcohol seems to have most of the values at the low end of the curve, and the higher alcohol amounts being less frequent. The lowering curve is quite linear with little outliers. The variance also seems to be quite high.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.40   10.54   11.40   14.20

Exploring bipartite relationships between independent variables and wine quality

After running ggpairs to view the relationships of features, the plots and the correlation values seem to indicate a poor correlation between the independant variables and the wine quality. This seems quite understandable as it is difficult to imagine there being linear relationships between wine quality and for example salt-, sugar-, and alcohol-content of the wine or acidity.

GGpairs output is not shown here as it looks poor on knit html. To get a clearer view of the variables relationships with wine quality, I will plot them as boxplots. The boxplots are a good way to interpret variation of values of the independent variables against our dependant variable, which can be interpreted as a categorical variable.

In the boxplots we visualize some promising possible independent variables vs wine quality.

Looking at the ggpairs-output, fixed.acidity, volatile.acidity, residual.sugar, total.sulfur.dioxide, citric.acid, density and alcohol seemed like the most promising dependant variables to affect wine quality.

In order to get a better estimate on the variance in the independent variable values, lets create boxplots of the relationships between the independent variables and wine quality.

Fixed acidity vs quality

Looking at the relationships in the boxplot, there still seems to be some outliers in the fixed acidity-values. Looking at the means and the middle quantiles, it seems that the correlation with wine quality seems quite low, non-linear and even non-modal. At qualities 4-7 the fixed.acidity seems to reduce but at quality of 8 the fixed.acidity rises again.

Lets zoom a bit closer to see the middle quantiles better

In the zoomed boxplot the correlations seem to indicate the same conclusions as in the non-zoomed one.

Due to the correlation looking obviously nonlinear and non-modal, the correlation tests will likely provide an incorrect metric on the correlation of the variables.

## 
##  Spearman's rank correlation rho
## 
## data:  as.numeric(wines_subset$quality) and wines_subset$fixed.acidity
## S = 1.7907e+10, p-value = 7.214e-08
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## -0.07900588

The correlation looks very low but it might be higher in actuality due to the non-modal nature of the relationship.

Volatile acidity vs quality

In volatile acidity the variance of values between different qualities seems higher. The correlation seems non-modal and non-linear here as well.

In the zoomed plot, the volatile acidity seems to drop at first between qualities 4-6 and then stay at somewhat similar levels.

Lets run a Spearman correlation test to quantify this correlation. Spearman is used as the correlation seems to be nonlinear.

## 
##  Spearman's rank correlation rho
## 
## data:  as.numeric(wines_subset$quality) and wines_subset$volatile.acidity
## S = 1.9539e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.1773517

The variables seem to have slight negative correlation as the boxplots suggest.

Residual sugar vs quality

Residual sugar seems to have fewer outliers in the boxplot than the previous variables. Also the variance between qualities seems nonexistant in the first quantile of the data and then the variance will rise significantly in the next quantiles. This seems to follow the significant long-tailed pattern seen in the distribution of residual.sugar.

Around the mean the quality and residual.sugar correlation seems to follow a strong bimodal pattern. Low quality wines seem to not be as sweet as the average wines and the sweetness starts to drop again once we move to higher quality wines (although 8-quality wines seem slightly sweet than the 7-quality ones).

Besides the non-modal correlation pattern, there seems to be significant variation between residual.sugar values and quality. But due to the non-modal nature of the relationship, a correlation test may give slightly misguided results and hence I will trust the visualization on the nature of the relationship.

Lets do the correlation test anyway (using Spearman again):

## 
##  Spearman's rank correlation rho
## 
## data:  as.numeric(wines_subset$quality) and wines_subset$residual.sugar
## S = 1.7922e+10, p-value = 5.065e-08
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## -0.07993064

The correlation value seems really low, which I feel does not tell the whole story on the relationship between residual.sugar and quality.

Total sulfur dioxide vs quality

There seem to be a few rather large outliers here and once again some nonlinear and non-modal correlation can be seen in the data. It seems that the total sulfur dioxide starts lower at lower quality wines, then rising a bit at average quality wiens and then lowering again as the quality increases and then reaching its minimum at quality 7 and then remains at similar values at quality 8 as well.

Lets look at the Spearman correlation:

## 
##  Spearman's rank correlation rho
## 
## data:  as.numeric(wines_subset$quality) and wines_subset$total.sulfur.dioxide
## S = 1.9774e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.1915279

Despite the nonmodal relationship of the variables, the correlation test shows some slight negative correlation. If we ignore the rising trend between qualities 4 and 5, the relationship does seem quite clearly like a modal, nonlinear, negative correlation.

Citric acid vs quality

Looking at the plot, there seem to be quite of few values outside the middle quantiles. The middle quantiles though seem to have quite low variance, especially at higher qualities. The relationship with quality also seems quite modal.

Zooming into the data, we can definately see some nonlinear positive correlation here.

Lets look the correlation with a Spearman test:

## 
##  Spearman's rank correlation rho
## 
## data:  as.numeric(wines_subset$quality) and wines_subset$citric.acid
## S = 1.6225e+10, p-value = 0.1288
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## 0.02231445

The test seems to indicate that the positive correlation is very low. This could, in part, be explained by the outliers in the dataset and the high variance of citric.acid the outer quantiles.

Density vs quality

The relationship, once again, seems non-modal, but there is definite variation between different wine qualities causing us to believe that there is some sort of a relationship between density and quality.

Lets try a Spearman test:

## 
##  Spearman's rank correlation rho
## 
## data:  as.numeric(wines_subset$quality) and wines_subset$density
## S = 2.2324e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.3451655

Despite the non-modality, there seems to be some definite correlation between the variables according to the Spearman test.

Alcohol vs quality

Once again, there seems to be a clear non-modal relationship between the variables. Interestingly enough, the pattern looks very similar to other independent variables. The correlation seems positive here though.

Lets try a Spearman test:

## 
##  Spearman's rank correlation rho
## 
## data:  as.numeric(wines_subset$quality) and wines_subset$alcohol
## S = 9258400000, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.4421265

Despite the non-modality, the positive correlation seems to clearly be there.

Trying out curious relationships between independet variables

Next, lets try to find interesting relationships between some of the independant variables. Some of the relationships-of-interest might include alcohol vs density, residual.sugar vs alcohol, citric.acid vs volatile.acidity, citric.acid vs residual.sugar.

Alcohol vs density

There seems to be some negative correlation between alcohol and density. As the density-level rises, the amount of alcohol reduces.

Lets see if any patterns emerge if we color the plot by quality.

There seems to be a slight pattern that high quality wines have both high amount of alcohol and low density. This complies with the discoveries when looking at the boxplots and correlations of alcohol vs quality and density vs quality.

Lets see how well the ratio of alcohol and density correlates with quality:

The pattern looks very similar to that of alcohol. This is due to the fact that density has very little variation between different qualities when compared to alcohol.

Residual sugar vs alcohol

It is difficult to see any patterns here, so lets add some alpha to the plot:

It is difficult to find any relationships in this chart. Most of the values seem to be on the left side of the chart, which is caused by the fact that a significant amount of wines had a low sugar amount (as seen in the long-tailed pattern of the residual.sugar histogram). Additionally, it seems that sweeter wines have a lower alcohol amount, which is curious.

Lets see if any patterns emerge when coloring the scatterplot with quality

When looking at the plot, there seems to be a slight pattern of qualities being higher at the high alcohol-low sugar end of the plot. This, however, might not indicate a meaningful relationship here, instead the combination of positive correlation of alcohol vs quality and the dataset having mostly low-sugar wines could provide an explanation to this phenomenon.

Lets look at the ratio of sugar and alcohol vs quality in a boxplot

There does seem to be some variance between the ratio-values at different qualities. The relationship is clearly nonlinear and also non-monotonic.

Lets test the correlation using Spearman

## 
##  Spearman's rank correlation rho
## 
## data:  as.numeric(wines_subset$quality) and (wines_subset$alcohol/wines_subset$residual.sugar)
## S = 1.4476e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.1277507

There seems to be some small positive correlation in the model. The correlation is unreliable, however, as the relationship is clearly non-monotonic therefore breaking the assumptions of a Spearman test.

Citric acid vs volatile acidity

According to the plot, there doesn’t really seem to be any meaningful relationships between the variables.

Lets see if anything interesting comes up when coloring the plot with quality:

ggplot(aes(x = citric.acid, y = volatile.acidity, color = quality_factor),
    data = wines_subset) +
  geom_point()

Now this might be interesting. In the middle of the cluster, there seems to be a high concentration of higher quality wines. Could there be a relationship here?

Lets see if there is a correlation between the ratio of citric.acidity and volatile.acidity vs quality

There seems to be slight positive correlation seen in the boxplot.

Lets do a Spearman test:

## 
##  Spearman's rank correlation rho
## 
## data:  as.numeric(wines_subset$quality) and (wines_subset$citric.acid/wines_subset$volatile.acidity)
## S = 1.4628e+10, p-value = 5.604e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.1185538

The test seems to comply with what the boxplot already told us - there does seem to be a small positive correlation.

Citric acid vs residual sugar

This plot is also difficult to follow as the residual sugar values are so heavily skewed towards the low amounts of sugar. No clear patterns emerge from the plot, except perhaps the fact that it seems that low citric.acid is only given to wines with low amounts of sugar. But that may also be due to the fact that the vast majority of wines are low in sugar.

Lets see if we can find anything by coloring the plot with quality.

The doesn’t seem to be any visible patterns in the scatterplot.

Lets see if there is a correlation between the ratio of citric.acidity and volatile.acidity vs quality

The patterns here seem very small. Lets zoom the plot a bit to see more.

The correlation is tiny, but it seems to be there. The correlation looks nonlinear and non-monotonic, forming a kind of a wave-pattern in its relationship with quality.

Lets do a Spearman test:

## 
##  Spearman's rank correlation rho
## 
## data:  as.numeric(wines_subset$quality) and (wines_subset$citric.acid/wines_subset$residual.sugar)
## S = 1.4804e+10, p-value = 1.682e-13
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##      rho 
## 0.107994

The seems to be some small positive correlation in the relationship of the ratio of citric acid and residual sugar vs quality. The correlation test is, however, unreliable as the relationship is clearly non-monotonic.

Running a linear model to predict quality

Lets see what the correlation values for fixed.acidity, volatile.acidity, residual.sugar, total.sulfur.dioxide, citric.acid, density and alcohol were:

Note that the chart uses absolute values as we don’t really care about the direction of the correlation - only the size of it. Alcohol and alcohol/density look very similar in size - this is explained by the fact that they really are the same thing, as explained earlier.

Lets try to build a linear model for wine quality using independent variables with the largest perceived correlation.

## 
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = wines_subset)
## m2: lm(formula = I(quality) ~ I(alcohol) + alcohol + alcohol:density, 
##     data = wines_subset)
## m3: lm(formula = I(quality) ~ I(alcohol) + alcohol + density + alcohol:density, 
##     data = wines_subset)
## m4: lm(formula = I(quality) ~ I(alcohol) + alcohol + density + citric.acid + 
##     alcohol:density, data = wines_subset)
## m5: lm(formula = I(quality) ~ I(alcohol) + alcohol + density + citric.acid + 
##     volatile.acidity + alcohol:density, data = wines_subset)
## 
## ==============================================================================
##                        m1          m2          m3          m4          m5     
## ------------------------------------------------------------------------------
## (Intercept)          2.628***    1.930***  -239.231*** -236.496*** -305.231***
##                     (0.098)     (0.177)     (35.021)    (35.121)    (34.260)  
## I(alcohol)           0.310***   -2.796***    20.767***   20.609***   26.209***
##                     (0.009)     (0.656)      (3.483)     (3.487)     (3.394)  
## alcohol x density                3.193***   -20.544***  -20.386***  -25.994***
##                                 (0.674)      (3.512)     (3.515)     (3.421)  
## density                                     242.881***  240.109***  309.650***
##                                             (35.270)    (35.373)    (34.507)  
## citric.acid                                               0.106      -0.229*  
##                                                          (0.103)     (0.101)  
## volatile.acidity                                                     -2.076***
##                                                                      (0.119)  
## ------------------------------------------------------------------------------
## R-squared               0.195       0.199       0.207       0.208       0.256 
## adj. R-squared          0.195       0.199       0.207       0.207       0.255 
## sigma                   0.773       0.771       0.768       0.768       0.744 
## F                    1124.586     576.114     403.732     303.068     318.724 
## p                       0.000       0.000       0.000       0.000       0.000 
## Log-likelihood      -5384.008   -5372.809   -5349.199   -5348.667   -5202.000 
## Deviance             2770.300    2756.945    2729.001    2728.375    2561.055 
## AIC                 10774.016   10753.618   10708.398   10709.334   10417.999 
## BIC                 10793.340   10779.383   10740.604   10747.983   10463.089 
## N                    4635        4635        4635        4635        4635     
## ==============================================================================

It seems that the alcohol/density, density, and citric.acid affected the model very little.

Lets add the rest of the features:

## 
## Calls:
## m6: lm(formula = I(quality) ~ I(alcohol) + alcohol + density + citric.acid + 
##     volatile.acidity + alcohol:density + alcohol:residual.sugar, 
##     data = wines_subset)
## m7: lm(formula = I(quality) ~ I(alcohol) + alcohol + density + citric.acid + 
##     volatile.acidity + alcohol:density + alcohol:residual.sugar + 
##     citric.acid:volatile.acidity, data = wines_subset)
## m8: lm(formula = I(quality) ~ I(alcohol) + alcohol + density + citric.acid + 
##     volatile.acidity + alcohol:density + alcohol:residual.sugar + 
##     citric.acid:volatile.acidity + citric.acid:residual.sugar, 
##     data = wines_subset)
## m9: lm(formula = I(quality) ~ I(alcohol) + alcohol + density + citric.acid + 
##     volatile.acidity + total.sulfur.dioxide + alcohol:density + 
##     alcohol:residual.sugar + citric.acid:volatile.acidity + citric.acid:residual.sugar, 
##     data = wines_subset)
## m10: lm(formula = I(quality) ~ I(alcohol) + alcohol + density + citric.acid + 
##     volatile.acidity + total.sulfur.dioxide + fixed.acidity + 
##     alcohol:density + alcohol:residual.sugar + citric.acid:volatile.acidity + 
##     citric.acid:residual.sugar, data = wines_subset)
## m11: lm(formula = I(quality) ~ I(alcohol) + alcohol + density + citric.acid + 
##     volatile.acidity + total.sulfur.dioxide + fixed.acidity + 
##     residual.sugar + alcohol:density + alcohol:residual.sugar + 
##     citric.acid:volatile.acidity + citric.acid:residual.sugar, 
##     data = wines_subset)
## m12: lm(formula = I(quality) ~ I(alcohol) + alcohol + density + citric.acid + 
##     volatile.acidity + total.sulfur.dioxide + fixed.acidity + 
##     residual.sugar + sulphates + alcohol:density + alcohol:residual.sugar + 
##     citric.acid:volatile.acidity + citric.acid:residual.sugar, 
##     data = wines_subset)
## 
## ===================================================================================================================
##                                     m6          m7          m8          m9          m10         m11         m12    
## -------------------------------------------------------------------------------------------------------------------
## (Intercept)                     -167.504*** -166.448*** -180.885*** -160.624*** -165.984*** -240.560*** -228.909***
##                                  (37.333)    (37.330)    (39.480)    (39.741)    (39.890)    (66.044)    (65.867)  
## I(alcohol)                        22.809***   22.680***   24.097***   23.449***   22.791***   29.574***   31.372***
##                                   (3.387)     (3.387)     (3.615)     (3.612)     (3.638)     (6.013)     (6.002)  
## density                          172.154***  171.235***  185.782***  165.234***  170.649***  246.063***  234.368***
##                                  (37.546)    (37.542)    (39.712)    (39.979)    (40.131)    (66.661)    (66.483)  
## citric.acid                       -0.082      -0.540      -0.386      -0.329      -0.291      -0.312      -0.318   
##                                   (0.102)     (0.283)     (0.314)     (0.314)     (0.315)     (0.315)     (0.314)  
## volatile.acidity                  -2.093***   -2.542***   -2.526***   -2.496***   -2.501***   -2.541***   -2.519***
##                                   (0.118)     (0.283)     (0.284)     (0.283)     (0.283)     (0.285)     (0.284)  
## alcohol x density                -22.706***  -22.575***  -24.009***  -23.360***  -22.681***  -29.540***  -31.397***
##                                   (3.413)     (3.413)     (3.643)     (3.641)     (3.668)     (6.073)     (6.063)  
## alcohol x residual.sugar           0.005***    0.005***    0.005***    0.005***    0.005***    0.010**     0.012** 
##                                   (0.001)     (0.001)     (0.001)     (0.001)     (0.001)     (0.004)     (0.004)  
## citric.acid x volatile.acidity                 1.486       1.491       1.193       1.222       1.229       1.250   
##                                               (0.854)     (0.854)     (0.855)     (0.856)     (0.855)     (0.853)  
## citric.acid x residual.sugar                              -0.021      -0.020      -0.022      -0.018      -0.023   
##                                                           (0.019)     (0.019)     (0.019)     (0.019)     (0.019)  
## total.sulfur.dioxide                                                   0.001***    0.001***    0.001***    0.001***
##                                                                       (0.000)     (0.000)     (0.000)     (0.000)  
## fixed.acidity                                                                     -0.025      -0.026      -0.006   
##                                                                                   (0.016)     (0.016)     (0.017)  
## residual.sugar                                                                                -0.056      -0.062   
##                                                                                               (0.040)     (0.039)  
## sulphates                                                                                                  0.594***
##                                                                                                           (0.107)  
## -------------------------------------------------------------------------------------------------------------------
## R-squared                            0.269       0.269       0.269       0.272       0.272       0.272       0.277 
## adj. R-squared                       0.268       0.268       0.268       0.270       0.271       0.271       0.275 
## sigma                                0.738       0.737       0.737       0.736       0.736       0.736       0.734 
## F                                  283.287     243.357     213.107     191.819     172.919     157.416     147.771 
## p                                    0.000       0.000       0.000       0.000       0.000       0.000       0.000 
## Log-likelihood                   -5162.683   -5161.165   -5160.533   -5152.535   -5151.371   -5150.365   -5135.072 
## Deviance                          2517.972    2516.323    2515.637    2506.971    2505.712    2504.624    2488.152 
## AIC                              10341.365   10340.329   10341.065   10327.070   10326.741   10326.730   10298.145 
## BIC                              10392.896   10398.302   10405.479   10397.926   10404.038   10410.468   10388.324 
## N                                 4635        4635        4635        4635        4635        4635        4635     
## ===================================================================================================================

Most of the features affect the model very little. The ratio-features offer little else in explaining the variance of quality. It seems that the adding of sulphates will improve the model a little.

Overall, the proposed model explains 27.7% of the variation in quality, which is quite poor.

Lets see what went wrong:

## 
## Call:
## lm(formula = I(quality) ~ I(alcohol) + alcohol + density + citric.acid + 
##     volatile.acidity + total.sulfur.dioxide + fixed.acidity + 
##     residual.sugar + sulphates + alcohol:density + alcohol:residual.sugar + 
##     citric.acid:volatile.acidity + citric.acid:residual.sugar, 
##     data = wines_subset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.94009 -0.53649 -0.02716  0.42322  2.57985 
## 
## Coefficients: (1 not defined because of singularities)
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -2.289e+02  6.587e+01  -3.475 0.000515 ***
## I(alcohol)                    3.137e+01  6.002e+00   5.226 1.80e-07 ***
## alcohol                              NA         NA      NA       NA    
## density                       2.344e+02  6.648e+01   3.525 0.000427 ***
## citric.acid                  -3.175e-01  3.144e-01  -1.010 0.312579    
## volatile.acidity             -2.519e+00  2.839e-01  -8.873  < 2e-16 ***
## total.sulfur.dioxide          1.071e-03  3.234e-04   3.311 0.000936 ***
## fixed.acidity                -6.439e-03  1.661e-02  -0.388 0.698281    
## residual.sugar               -6.191e-02  3.950e-02  -1.567 0.117106    
## sulphates                     5.943e-01  1.074e-01   5.532 3.35e-08 ***
## alcohol:density              -3.140e+01  6.063e+00  -5.178 2.34e-07 ***
## alcohol:residual.sugar        1.207e-02  3.746e-03   3.222 0.001282 ** 
## citric.acid:volatile.acidity  1.250e+00  8.527e-01   1.466 0.142738    
## citric.acid:residual.sugar   -2.258e-02  1.910e-02  -1.182 0.237269    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7337 on 4622 degrees of freedom
## Multiple R-squared:  0.2773, Adjusted R-squared:  0.2754 
## F-statistic: 147.8 on 12 and 4622 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Response: I(quality)
##                                Df  Sum Sq Mean Sq   F value    Pr(>F)    
## I(alcohol)                      1  672.45  672.45 1249.1376 < 2.2e-16 ***
## density                         1   21.13   21.13   39.2511 4.065e-10 ***
## citric.acid                     1    0.97    0.97    1.8060 0.1790508    
## volatile.acidity                1  155.20  155.20  288.3006 < 2.2e-16 ***
## total.sulfur.dioxide            1    6.35    6.35   11.8021 0.0005969 ***
## fixed.acidity                   1   23.68   23.68   43.9936 3.673e-11 ***
## residual.sugar                  1   42.14   42.14   78.2839 < 2.2e-16 ***
## sulphates                       1   15.47   15.47   28.7366 8.696e-08 ***
## alcohol:density                 1    9.82    9.82   18.2509 1.975e-05 ***
## alcohol:residual.sugar          1    5.48    5.48   10.1731 0.0014346 ** 
## citric.acid:volatile.acidity    1    1.14    1.14    2.1249 0.1449932    
## citric.acid:residual.sugar      1    0.75    0.75    1.3971 0.2372692    
## Residuals                    4622 2488.15    0.54                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value (Pr(>t)) seems quite high for citric.acid, citric.acid/volatile.acidity, and citric.acid/residual.sugar, causing me to believe that they did not have a significant effect on wine quality. These were the features using values of citric.acid causing me to definately believe that citric.acid did not have an effect on quality.

Lets remove citric.acid-features and create the final linear model:

## 
## Call:
## lm(formula = I(quality) ~ I(alcohol) + density + volatile.acidity + 
##     alcohol + total.sulfur.dioxide + fixed.acidity + residual.sugar + 
##     sulphates + alcohol:residual.sugar + density:alcohol, data = wines_subset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9512 -0.5355 -0.0213  0.4228  2.5903 
## 
## Coefficients: (1 not defined because of singularities)
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -2.194e+02  6.556e+01  -3.346 0.000827 ***
## I(alcohol)              3.052e+01  5.968e+00   5.114 3.27e-07 ***
## density                 2.247e+02  6.618e+01   3.395 0.000692 ***
## volatile.acidity       -2.146e+00  1.195e-01 -17.961  < 2e-16 ***
## alcohol                        NA         NA      NA       NA    
## total.sulfur.dioxide    1.089e-03  3.205e-04   3.396 0.000689 ***
## fixed.acidity          -9.185e-03  1.610e-02  -0.570 0.568403    
## residual.sugar         -6.891e-02  3.915e-02  -1.760 0.078484 .  
## sulphates               5.831e-01  1.072e-01   5.439 5.63e-08 ***
## alcohol:residual.sugar  1.208e-02  3.741e-03   3.230 0.001247 ** 
## density:alcohol        -3.054e+01  6.028e+00  -5.066 4.22e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7338 on 4625 degrees of freedom
## Multiple R-squared:  0.2766, Adjusted R-squared:  0.2752 
## F-statistic: 196.5 on 9 and 4625 DF,  p-value: < 2.2e-16

Summary

In this research I explored what chemical qualities could affect the quality of white wine.

It seemed that the most significant factor that affected white wine quality in this dataset was quite surprisingly alcohol. Here is a boxplot summarizing the relationship of the amount of alcohol and quality.

When exploring multivariate relationships, a curious relationship was found between citric.acid, volatile.acid and quality.

In this plot, the good quality wines seemed to cluster around a certain area. But even so, regression analysis showed with high confidence that citric.acid does not affect wine quality.

Finally, the correlations of each explored feature is displayed in this barplot:

In the plot, the bars are colored by the correlation direction, and the plot is sorted by the correlation size. It is evident in the plot that alcohol seems to be the most important factor in wine quality.

Reflection

In this research I explored what chemical qualities could affect the quality of white wine.

First, the features were plotted as histograms to detect outliers and to discover some interesting patterns in their distributions. Then, all of the features correlations with quality were investigated using ggpairs. After that, a subset of the most promising independant variables were chosen to examine more closely. The chosen independant variables were plotted and tested for correlation with quality. Some of the independant variables were plotted with each other to discover interesting patterns and to perhaps create additional features from the ratios of these independant variables. Finally, a linear model is built to see how well the chemical properties could explain wine quality using a simple model. Turns out, not too well.

The features provided seem ill suited to explain wine quality. This is understandable, however, as human opinions and decision (such as how they rate wine) are notoriously difficult to explain with a handful of variables. The relationships of the variables were very nonlinear and non-modal causing us to believe that different chemical properties will fit different wine in ways that are difficult to estimate, at least using just the features given in this dataset.

The dataset was also quite limited in different ways, for example almost all of the wines in the dataset had qualities of 5-7. It would have been interesting if the dataset had contained more data on very high and very low quality wines in addition to some more features that could explain the variance in quality a bit better.

Besides the shortcomings of the linear model, the exploratory analysis provided some interesting insight on how the chemical properties affect wine quality in this dataset. For example, alcohol seemed to be the most significant variable in defining wine quality. This hopefully doesn’t mean that the wine experts mostly look for alcohol in their wine, but more probably it just could explain the fact that more mature wine often has a higher alcoholic content. And more mature wine is often regard as higher quality. It is good not to confuse correlation with causation when interpreting these results.